The purpose of this package is to aid in interpretation of rich and complex enrichment or GSEA results. The first step is to cluster gene sets based on the overlap coefficient, which is a metric of similarity that does not punish differences in gene set size. This allows child sets in hierarchical references like GO to cluster with their parents. The package then allows various visualizations of clusters to help summarize biological functions that might be contained within a long list of significantly enriched gene sets.
The package can be installed from GitHub using:
install.packages("devtools")
devtools::install_github("BIGslu/madRich")
library(madRich)
This is a hypergeometric enrichment calculated with the SEARchways package. It uses a gene list from that package, and performs an enrichment against the C2: Canonical Pathways and C5: GO Biological Processes from the Broad Institute’s Molecular Signatures Database (MSigDB).
gene_list2 <- list(HRV1 = names(SEARchways::example.gene.list[[1]]),
HRV2 = names(SEARchways::example.gene.list[[2]]))
df1 <- SEARchways::BIGprofiler(gene_list=gene_list2,
collection="C2", subcollection ="CP", ID="ENSEMBL")
#> [1] "HRV1"
#>
#> [1] "HRV2"
df2 <- SEARchways::BIGprofiler(gene_list=gene_list2,
collection="C5", subcollection="GO:BP", ID="ENSEMBL")
#> [1] "HRV1"
#> [1] "HRV2"
enrichment <- dplyr::bind_rows(df1, df2)
This results looks like this:
head(enrichment)
#> # A tibble: 6 × 14
#> group n_query_genes gs_collection gs_subcollection n_background_genes pathway
#> <chr> <int> <chr> <chr> <dbl> <chr>
#> 1 HRV1 100 C2 CP:KEGG_MEDICUS 15461 KEGG_ME…
#> 2 HRV1 100 C2 CP:REACTOME 15461 REACTOM…
#> 3 HRV1 100 C2 CP:WIKIPATHWAYS 15461 WP_MITO…
#> 4 HRV1 100 C2 CP:WIKIPATHWAYS 15461 WP_NICO…
#> 5 HRV1 100 C2 CP:REACTOME 15461 REACTOM…
#> 6 HRV1 100 C2 CP:REACTOME 15461 REACTOM…
#> # ℹ 8 more variables: n_pathway_genes <dbl>, n_query_genes_in_pathway <dbl>,
#> # `k/K` <dbl>, pval <dbl>, FDR <dbl>, qvalue <dbl>, genes <list>,
#> # pathway_GOID <chr>
Whatever your enrichment format, the following steps require a data frame with the following columns: pathway, k/K (for hypergeometric) or NES (for GSEA), FDR (if applying a cutoff), collection & subcollection (if not providing a custom database). If you are enriching against a custom database rather than Broad databases, you must provide this as a data frame with the columns “gs_name” for pathways and “gene” for member genes.
This function creates a list of objects to help guide the choice of tree cut height when finding gene set clusters with clusterSets(). It requires an enrichment or GSEA input. For hypergeometric enrichment, it will provide a single plot and “best” cut height value. For GSEA, it will provide a sign-separated result (one plot and height for positive NES pathways, one for negative).
The function requires either (A) a vector of Broad gene set collections represented in the enrichment result with a named vector of subcollections, or (B), a custom database frame with the columns “gs_name” for pathways and “gene” for member genes.
Also required is a vector of tree cut heights for hclust to calculate values of K and sillhouette scores.
cut_height_test1 <- chooseCutHeight(df = enrichment,
enrich_method="hypergeometric",
ID = "ENSEMBL",
collections = c("C5", "C2"),
subcollections = c("C2" = "CP", "C5" = "GO:BP"),
hclust_heights = c(0.3,0.5,0.7,0.9),
group_name = "HRV1",
fdr_cutoff = 0.4,
kK_cutoff = 0.03)
The first is the formatted database with only the pathways passing any applied strength/significance cutoffs
head(cut_height_test1$database_format)
#> # A tibble: 6 × 4
#> sign pathway gene gs_description
#> <chr> <chr> <chr> <chr>
#> 1 pathways GOBP_AXON_EXTENSION ENSG00000097007 Long distance growth of a single…
#> 2 pathways GOBP_AXON_EXTENSION ENSG00000143199 Long distance growth of a single…
#> 3 pathways GOBP_AXON_EXTENSION ENSG00000101126 Long distance growth of a single…
#> 4 pathways GOBP_AXON_EXTENSION ENSG00000170017 Long distance growth of a single…
#> 5 pathways GOBP_AXON_EXTENSION ENSG00000176248 Long distance growth of a single…
#> 6 pathways GOBP_AXON_EXTENSION ENSG00000130203 Long distance growth of a single…
The second is a summary of k at the provided cut heights (sign separated in the case of GSEA input).
cut_height_test1$summary_df
#> cut_height k_at_height avg_silhouette_width sign
#> 1 0.3 134 0.3089300 NA
#> 2 0.5 114 0.3108784 NA
#> 3 0.7 94 0.3098215 NA
#> 4 0.9 46 0.3130909 NA
The third is a plot showing average silhouette score by number of clusters.
cut_height_test1$silhouette_score_plot
And the last is a “best” cut height selected from the input vector of
possible cut heights that maximized silhouette width.
cut_height_test1$best_height
#> [1] 0.9
It may benefit to run this function through several times, narrowing in on a cut height that feels best. In our first example, 0.9 was the “best” height. According to our plot, across all possible values of k the average silhouette width is optimized at k = 122 (height between 0.3 and 0.5). BUT, there is also a plateau that matters here. There is not a ton if difference in silhouette score between k ~ 40 and k ~ 140. But reducing our gene set list from ~ 200 sets to ~ 140 clusters isn’t super helpful. Dropping from ~ 200 sets to ~ 20 clusters WOULD be. So let’s try several values above 0.9 to see if we can get a lower and more useful value for k.
cut_height_test2 <- chooseCutHeight(df = enrichment,
enrich_method="hypergeometric",
ID = "ENSEMBL",
collections = c("C5", "C2"),
subcollections = c("C2" = "CP", "C5" = "GO:BP"),
hclust_heights = c(0.95, 0.96, 0.97),
group_name = "HRV1",
fdr_cutoff = 0.4,
kK_cutoff = 0.03)
cut_height_test2$best_height
#> [1] 0.96
A cut height of 0.96 gets us k = 30, which is pretty close to a local
maximum on the plot. So I am going to go with that. You may have reasons
to choose a value that is NOT the overall (or even a local) maximum
silhouette score value. You may find that you need more clusters to
break down a complicated enrichment. Or the “optimal” value may provide
too FEW clusters to be useful for your analysis. You can see that there
are several local maxima along the curve of values of k. These are
relatively better clustering solutions than their neighbors, so try to
choose a local maximum even if deviating from the absolute maximum. And
keep in mind the possible diminishing returns of adding more clusters
when your plot has a large plateau like this one.
This function will generate the final clusters. Its inputs are very similar to the previous, and the output will provide the data for all the visualizations in this package.
cluster_solution <- clusterSets(df = enrichment,
enrich_method="hypergeometric",
ID = "ENSEMBL",
collections = c("C5", "C2"),
subcollections = c("C2" = "CP", "C5" = "GO:BP"),
hclust_height = cut_height_test2$best_height,
group_name = "HRV1",
fdr_cutoff = 0.4,
kK_cutoff = 0.03)
The dendrogram shows the tree generated in hclust, with node colors and numbers showing cluster assignment of each gene set.
clusterDendro(cluster_solution)
The scatterplot shows a dimension reduction (tSNE, UMAP, or PCoA) of clustered sets. The size of circles in the scatterplot can indicate significance (-log10(pval)) or gene set size. I think tSNE does the best job of visually dispersing the points, but your ability to use tSNE or UMAP may depend on the underlying variability in your particular clustering solution.
clusterScatter(cluster_solution, dimred = "tSNE", scores = "pvalue")
The treemap plot shows gene set clusters together and sizes boxes based on either gene set size or -log10(pval). It is a good way to visualize the clusters, their member sets, and their relative significance. In this case, I can see that my cluster 1 is a large and possibly informative cluster. So I will focus on that one for the next visualization.
clusterTreemap(cluster_solution, scores = "pvalue")
These plots take text from gene set names and descriptions and generate word clouds.
wordclouds <- clusterWordclouds(cluster_solution)
#> Joining with `by = join_by(pathway)`
So wordclouds$wordcloud_names[[“1]] would have word clouds made from the gene set NAMES of sets in cluster 1. The function removes common words like articles, but additional words can be removed with the “rmwords” parameter in clusterWordclouds.
wordclouds$wordcloud_names[["1"]]
This gives us some insight into what biological function might be
represented by this subset of our enrichment result. There is also a
“wordcloud_descriptions” list which creates wordclouds based on the
gs_description column in the Broad databases. And “wordclouds_both” list
which uses both set names and their descriptions. Note that these word
clouds are only useful for a custom database if you have well-curated
gs_name and gs_description columns in the input.
wordclouds$wordcloud_names[["18"]]
wordclouds$wordcloud_descriptions[["18"]]
wordclouds$wordcloud_both[["18"]]
sessionInfo()
#> R version 4.5.2 (2025-10-31)
#> Platform: aarch64-apple-darwin20
#> Running under: macOS Tahoe 26.2
#>
#> Matrix products: default
#> BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.1
#>
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#>
#> time zone: America/Los_Angeles
#> tzcode source: internal
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] madRich_1.0.0
#>
#> loaded via a namespace (and not attached):
#> [1] RColorBrewer_1.1-3 ggdendro_0.2.0 rstudioapi_0.17.1
#> [4] jsonlite_2.0.0 magrittr_2.0.4 ggtangle_0.0.9
#> [7] farver_2.1.2 rmarkdown_2.30 fs_1.6.6
#> [10] vctrs_0.6.5 memoise_2.0.1 ggtree_3.16.3
#> [13] rstatix_0.7.3 htmltools_0.5.9 curl_7.0.0
#> [16] broom_1.0.11 Formula_1.2-5 gridGraphics_0.5-1
#> [19] sass_0.4.10 bslib_0.9.0 plyr_1.8.9
#> [22] cachem_1.1.0 ggfittext_0.10.3 commonmark_2.0.0
#> [25] igraph_2.2.1 lifecycle_1.0.4 iterators_1.0.14
#> [28] pkgconfig_2.0.3 Matrix_1.7-4 R6_2.6.1
#> [31] fastmap_1.2.0 gson_0.1.0 GenomeInfoDbData_1.2.14
#> [34] digest_0.6.39 aplot_0.2.9 enrichplot_1.28.4
#> [37] colorspace_2.1-2 patchwork_1.3.2 AnnotationDbi_1.70.0
#> [40] S4Vectors_0.46.0 textshaping_1.0.4 RSQLite_2.4.5
#> [43] ggpubr_0.6.2 labeling_0.4.3 httr_1.4.7
#> [46] abind_1.4-8 compiler_4.5.2 bit64_4.6.0-1
#> [49] withr_3.0.2 S7_0.2.1 backports_1.5.0
#> [52] BiocParallel_1.42.2 carData_3.0-5 DBI_1.2.3
#> [55] R.utils_2.13.0 ggsignif_0.6.4 MASS_7.3-65
#> [58] rappdirs_0.3.3 tools_4.5.2 otel_0.2.0
#> [61] ape_5.8-1 msigdbr_25.1.1 R.oo_1.27.1
#> [64] glue_1.8.0 treemapify_2.6.0 nlme_3.1-168
#> [67] GOSemSim_2.34.0 gridtext_0.1.5 grid_4.5.2
#> [70] Rtsne_0.17 cluster_2.1.8.1 reshape2_1.4.5
#> [73] fgsea_1.34.2 generics_0.1.4 gtable_0.3.6
#> [76] R.methodsS3_1.8.2 tidyr_1.3.2 data.table_1.17.8
#> [79] SEARchways_1.5.1 xml2_1.5.1 car_3.1-3
#> [82] utf8_1.2.6 XVector_0.48.0 BiocGenerics_0.54.1
#> [85] markdown_2.0 ggrepel_0.9.6 foreach_1.5.2
#> [88] pillar_1.11.1 stringr_1.6.0 yulab.utils_0.2.2
#> [91] babelgene_22.9 splines_4.5.2 dplyr_1.1.4
#> [94] treeio_1.32.0 lattice_0.22-7 bit_4.6.0
#> [97] tidyselect_1.2.1 GO.db_3.21.0 Biostrings_2.76.0
#> [100] knitr_1.51 litedown_0.9 IRanges_2.42.0
#> [103] svglite_2.2.2 stats4_4.5.2 xfun_0.55.3
#> [106] Biobase_2.68.0 factoextra_1.0.7 stringi_1.8.7
#> [109] UCSC.utils_1.4.0 lazyeval_0.2.2 ggfun_0.2.0
#> [112] yaml_2.3.12 evaluate_1.0.5 codetools_0.2-20
#> [115] ggwordcloud_0.6.2 tibble_3.3.0 qvalue_2.40.0
#> [118] ggplotify_0.1.3 cli_3.6.5 systemfonts_1.3.1
#> [121] jquerylib_0.1.4 Rcpp_1.1.0 GenomeInfoDb_1.44.3
#> [124] png_0.1-8 parallel_4.5.2 ggplot2_4.0.1
#> [127] assertthat_0.2.1 blob_1.2.4 clusterProfiler_4.16.0
#> [130] DOSE_4.2.0 tidytree_0.4.6 scales_1.4.0
#> [133] purrr_1.2.0 crayon_1.5.3 scico_1.5.0
#> [136] rlang_1.1.6 cowplot_1.2.0 fastmatch_1.1-6
#> [139] KEGGREST_1.48.1